[YUNIKORN-2068] E2E Test for Preemption #705

rrajesh-cloudera · 2023-10-26T18:21:05Z

Added a new test case to cover preemption scenarios on a specific node.

What type of PR is it?

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2069

codecov · 2023-10-27T00:52:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (44d8c38) 71.65% compared to head (461cff1) 69.52%.
Report is 24 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #705      +/-   ##
==========================================
- Coverage   71.65%   69.52%   -2.14%     
==========================================
  Files          50       50              
  Lines        7953     7993      +40     
==========================================
- Hits         5699     5557     -142     
- Misses       2056     2248     +192     
+ Partials      198      188      -10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pbacsko

@rrajesh-cloudera the test case failed during the latest run:

2023-11-07T05:22:50.5444473Z �[0mPreemption �[0m�[1mVerify_preemption_on_specific_node�[0m
2023-11-07T05:22:50.5446062Z �[38;5;243m/home/runner/work/yunikorn-k8shim/yunikorn-k8shim/test/e2e/preemption/preemption_test.go:553�[0m
2023-11-07T05:22:50.5447933Z   �[1mSTEP:�[0m Create Two Queue High and Low Guaranteed Limit �[38;5;243m@ 11/07/23 05:22:50.544�[0m
2023-11-07T05:22:50.5455737Z   �[1mSTEP:�[0m Port-forward the scheduler pod �[38;5;243m@ 11/07/23 05:22:50.545�[0m
2023-11-07T05:22:50.5459264Z port-forward is already running  �[1mSTEP:�[0m Enabling new scheduling config �[38;5;243m@ 11/07/23 05:22:50.545�[0m
2023-11-07T05:22:53.5573043Z   �[1mSTEP:�[0m Schedule a number of small, Low priority pause tasks on Low Guaranteed queue (Enough to fill the node) �[38;5;243m@ 11/07/23 05:22:53.556�[0m
2023-11-07T05:22:53.5574921Z   �[1mSTEP:�[0m Deploy the sleep pod sleepjob1 to the development namespace �[38;5;243m@ 11/07/23 05:22:53.557�[0m
2023-11-07T05:23:53.7602458Z   �[38;5;9m[FAILED]�[0m in [It] - /home/runner/work/yunikorn-k8shim/yunikorn-k8shim/test/e2e/preemption/preemption_test.go:603 �[38;5;243m@ 11/07/23 05:23:53.759�[0m
2023-11-07T05:23:53.7604096Z   Unexpected error:
2023-11-07T05:23:53.7613907Z       <context.deadlineExceededError>: 
2023-11-07T05:23:53.7614589Z       context deadline exceeded
2023-11-07T05:23:53.7615095Z       {}

pbacsko · 2023-11-09T16:12:50Z

@rrajesh-cloudera you can avoid lintern problems by locally running "make lint".

pbacsko

Some comments, there's one major. Double check the output of Yunikorn. We don't want to trigger the DaemonSet-specific preemption logic.

pbacsko · 2023-11-27T20:38:45Z

test/e2e/preemption/preemption_test.go

+			}
+		}
+		Ω(sandbox1RunningPodsCnt).To(gomega.Equal(2), "One of the pods in root.sandbox1 should be preempted")
+		errNew := kClient.DeletePods(newNamespace.Name)


Already performed inside AfterEach()

After each function is doing cleanup for a global namespace which is created for all the tests, In this test I am creating a new namespace and doing cleanup at the end of the test function to ensure there is no resources are available. I tried with the Global namespace and was facing some issue so trying to create inside the test script only and clean once execution is done.

If you need a separate namespace for each test case, that should be done in BeforeEach/AfterEach. If something fails before this code, it will never be cleaned up.

Example:

ginkgo.BeforeEach(func() { kubeClient = k8s.KubeCtl{} gomega.Expect(kubeClient.SetClient()).To(gomega.BeNil()) ns = "ns-" + common.RandSeq(10) ginkgo.By(fmt.Sprintf("Creating namespace: %s for admission controller tests", ns)) var ns1, err1 = kubeClient.CreateNamespace(ns, nil) gomega.Ω(err1).NotTo(gomega.HaveOccurred()) gomega.Ω(ns1.Status.Phase).To(gomega.Equal(v1.NamespaceActive)) }) ... ginkgo.AfterEach(func() { ginkgo.By("Tear down namespace: " + ns) err := kubeClient.TearDownNamespace(ns) gomega.Ω(err).NotTo(gomega.HaveOccurred()) // call the healthCheck api to check scheduler health ginkgo.By("Check YuniKorn's health") checks, err2 := yunikorn.GetFailedHealthChecks() gomega.Ω(err2).ShouldNot(gomega.HaveOccurred()) gomega.Ω(checks).Should(gomega.Equal(""), checks) })

pbacsko · 2023-11-27T20:39:04Z

test/e2e/preemption/preemption_test.go

+		 6. This should trigger preemption on low-priority queue and remove or preempt task from low priority queue
+		 7. Do cleanup once test is done either passed or failed
+		*/
+		time.Sleep(20 * time.Second)


Why do we start with a 20 second sleep? Looks very arbitrary.

The Cleanup for other tests is taking time and because of that the test which I added is failing intermittently for a few k8s versions. Added a sleep before the execution of test so it will trigger once the cleanup is done and resources are available to allocate.

That's interesting. I think the reason is that tests are not using different namespaces. See my comment below, you need to create/destroy a namespace for each test. This is what solved many problems for the admission controller tests, they often interfered with each other, debugging was difficult. A new namespace for every test solved it.

Hi Peter, Checked your comments and tried that approach but that required a lot of changes since we are using common namespace for all the tests so need to update each namespace as well. Also facing some issue with your approach.

Why does it need a lot of changes? That's what we're doing everywhere. Tests should be isolated by different namespaces. This 20 seconds sleep is only acceptable as a workaround. I'll try to see how difficult that is, but at least this must be addressed in a follow-up JIRA. This cannot stay as is.

OK, I can see what the problem is. I'll create a JIRA to separate the test cases. It's unacceptable to have all tests in a single namespace. They can easily interfere with each other.

https://issues.apache.org/jira/browse/YUNIKORN-2247

pbacsko · 2023-11-27T20:46:54Z

test/e2e/preemption/preemption_test.go

@@ -572,3 +656,11 @@ func createSandbox1SleepPodCofigs(cnt, time int) []k8s.SleepPodConfig {
 	}
 	return sandbox1Configs
 }
+
+func createSandbox1SleepPodCofigsWithStaticNode(cnt, time int) []k8s.SleepPodConfig {


"WithRequiredNode" or "WithNodeSelector" sounds better

Acknowledged.

pbacsko · 2023-11-27T21:10:13Z

test/e2e/preemption/preemption_test.go

+func createSandbox1SleepPodCofigsWithStaticNode(cnt, time int) []k8s.SleepPodConfig {
+	sandbox1Configs := make([]k8s.SleepPodConfig, 0, cnt)
+	for i := 0; i < cnt; i++ {
+		sandbox1Configs = append(sandbox1Configs, k8s.SleepPodConfig{Name: fmt.Sprintf("sleepjob%d", i+1), NS: devNew, Mem: sleepPodMemLimit2, Time: time, Optedout: k8s.Allow, Labels: map[string]string{"queue": "root.sandbox1"}, RequiredNode: nodeName})


This part is very suspicious. You're using the "RequiredNode" setting here, which will trigger the "simple" (aka. RequiredNode) preemption logic inside Yunikorn. That is not the generic preemption logic that Craig worked on.

Please check if the console log of Yunikorn contains this:

log.Log(log.SchedApplication).Info("Triggering preemption process for daemon set ask", zap.String("ds allocation key", ask.GetAllocationKey()))

If this is the case (which is VERY likely), changes are necessary.

A trivial solution is to enhance the RequiredNode field, it has to be a struct like:

type RequiredNode struct { Node string DaemonSet bool } type SleepPodConfig struct { ... Mem int64 RequiredNode RequiredNode Optedout AllowPreemptOpted }

If DaemonSet == false, then this code doesn't run:

yunikorn-k8shim/test/e2e/framework/helpers/k8s/pod_conf.go

Lines 76 to 77 in 28995cd

owner := metav1.OwnerReference{APIVersion: "v1", Kind: constants.DaemonSetType, Name: "daemonset job", UID: "daemonset"}

owners = []metav1.OwnerReference{owner}

So this line effectively becomes:

sandbox1Configs = append(sandbox1Configs, k8s.SleepPodConfig{ Name: fmt.Sprintf("sleepjob%d", i+1), NS: devNew, Mem: sleepPodMemLimit2, Time: time, Optedout: k8s.Allow, Labels: map[string]string{"queue": "root.sandbox1"}, RequiredNode: RequiredNode{Node: nodeName, DeamonSet: false} } )

The existing tests inside simple_preemptor_test.go are affected.

Have you looked at this? Again, the tests should address YUNIKORN-2068. RequiredNodePreemptor does not use the predicates.

rrajesh-cloudera · 2023-12-21T05:16:05Z

Hi @pbacsko , @craigcondit , @manirajv06 , @FrankYang0529 Please review the PR and provide your inputs.

pbacsko · 2023-12-21T18:36:27Z

Hi @pbacsko , @craigcondit , @manirajv06 , @FrankYang0529 Please review the PR and provide your inputs.

@rrajesh-cloudera as I mentioned in my comment (#705 (comment)) I don't think this test properly verifies what it is intended to. At this point I'm not even sure that we need an e2e test, because we're trying to replicate a scenario which causes (or used to cause) a race condition. A smoke test using MockScheduler might actually be a better solution here.

We can make a decision after everyone is back from vacation.

pbacsko · 2023-12-21T18:39:43Z

@rrajesh-cloudera please follow-up on my comment and check if the text "Triggering preemption process for daemon set ask" is present in the YK logs or not. I assume you can run the test locally in isolation.

[YUNIKORN-2068] E2E Test for Preemption

b83bdd5

craigcondit assigned rrajesh-cloudera Oct 26, 2023

[YUNIKORN-2068] E2E Test for Preemption

3830426

wilfred-s requested review from FrankYang0529 and craigcondit October 27, 2023 07:34

[YUNIKORN-2068] E2E Test for Preemption

72ef0d8

wilfred-s requested a review from manirajv06 October 30, 2023 07:55

rrajesh-cloudera added 8 commits October 30, 2023 15:51

[YUNIKORN-2068] E2E Test for Preemption

f3e8b57

[YUNIKORN-2068] E2E Test for Preemption

46eb327

[YUNIKORN-2068] E2E Test for Preemption

856b629

[YUNIKORN-2068] E2E Test for Preemption

7e22832

[YUNIKORN-2068] E2E Test for Preemption

353bf51

[YUNIKORN-2068] E2E Test for Preemption

2206722

[YUNIKORN-2068] E2E Test for Preemption

c972734

[YUNIKORN-2068] E2E Test for Preemption

daac46d

pbacsko self-requested a review November 2, 2023 16:03

rrajesh-cloudera added 5 commits November 3, 2023 11:32

[YUNIKORN-2068] E2E Test for Preemption

93ab76e

[YUNIKORN-2068] E2E Test for Preemption

d916cd9

[YUNIKORN-2068] E2E Test for Preemption

bca7152

[YUNIKORN-2068] E2E Test for Preemption

b30f5ec

[YUNIKORN-2068] E2E Test for Preemption

726c154

pbacsko requested changes Nov 7, 2023

View reviewed changes

rrajesh-cloudera added 7 commits November 7, 2023 22:41

[YUNIKORN-2068] E2E Test for Preemption

ad8e7eb

[YUNIKORN-2068] E2E Test for Preemption

c4f17b0

[YUNIKORN-2068] E2E Test for Preemption

ce0351b

[YUNIKORN-2068] E2E Test for Preemption

fae4a0f

[YUNIKORN-2068] E2E Test for Preemption

d2d1130

[YUNIKORN-2068] E2E Test for Preemption

5fbc97f

[YUNIKORN-2068] E2E Test for Preemption

c86acb1

rrajesh-cloudera added 3 commits November 9, 2023 22:12

cdh kli

d85fa9a

[YUNIKORN-2068] E2E Preemption tests

87494ce

[YUNIKORN-2068] E2E Preemption tests

7a30f8e

rrajesh-cloudera requested a review from pbacsko November 21, 2023 07:25

[YUNIKORN-2068] E2E Preemption tests

461cff1

pbacsko requested changes Nov 27, 2023

View reviewed changes

rrajesh-cloudera requested a review from pbacsko December 5, 2023 09:11

rrajesh-cloudera closed this Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-2068] E2E Test for Preemption #705

[YUNIKORN-2068] E2E Test for Preemption #705

rrajesh-cloudera commented Oct 26, 2023 •

edited

Loading

codecov bot commented Oct 27, 2023 •

edited

Loading

pbacsko left a comment

pbacsko commented Nov 9, 2023

pbacsko left a comment

pbacsko Nov 27, 2023

rrajesh-cloudera Nov 28, 2023

pbacsko Nov 28, 2023

pbacsko Nov 27, 2023

rrajesh-cloudera Nov 28, 2023

pbacsko Nov 28, 2023

rrajesh-cloudera Dec 5, 2023

pbacsko Dec 5, 2023

pbacsko Dec 7, 2023

pbacsko Nov 27, 2023

rrajesh-cloudera Dec 5, 2023

pbacsko Nov 27, 2023

pbacsko Dec 5, 2023 •

edited

Loading

rrajesh-cloudera commented Dec 21, 2023

pbacsko commented Dec 21, 2023 •

edited

Loading

pbacsko commented Dec 21, 2023

	owner := metav1.OwnerReference{APIVersion: "v1", Kind: constants.DaemonSetType, Name: "daemonset job", UID: "daemonset"}
	owners = []metav1.OwnerReference{owner}

[YUNIKORN-2068] E2E Test for Preemption #705

[YUNIKORN-2068] E2E Test for Preemption #705

Conversation

rrajesh-cloudera commented Oct 26, 2023 • edited Loading

What type of PR is it?

What is the Jira issue?

codecov bot commented Oct 27, 2023 • edited Loading

Codecov Report

pbacsko left a comment

Choose a reason for hiding this comment

pbacsko commented Nov 9, 2023

pbacsko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pbacsko Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

rrajesh-cloudera commented Dec 21, 2023

pbacsko commented Dec 21, 2023 • edited Loading

pbacsko commented Dec 21, 2023

rrajesh-cloudera commented Oct 26, 2023 •

edited

Loading

codecov bot commented Oct 27, 2023 •

edited

Loading

pbacsko Dec 5, 2023 •

edited

Loading

pbacsko commented Dec 21, 2023 •

edited

Loading